AITopics | multilingual speech recognition

Collaborating Authors

multilingual speech recognition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems

Nguyen, Tuan, Hoang, Long-Vu, Tran, Huy-Dat

arXiv.org Artificial IntelligenceJul-8-2025

This paper presents our system for the MLC-SLM Challenge 2025, focusing on multilingual speech recognition and language modeling with large language models (LLMs). Our approach combines a fine-tuned Whisper-large-v3 encoder with efficient projector architectures and various decoder configurations. We employ a three-stage training methodology that progressively optimizes the encoder, projector, and LLM components. Our system achieves competitive performance with a private test average WER/CER result of 16.63% using the Gemma3-12B and 18.6% using the Qwen2.5-7B

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.13596

Country: Asia > Vietnam (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Efficient Multilingual ASR Finetuning via LoRA Language Experts

Li, Jiahong, Shao, Yiwen, Zhuo, Jianheng, Li, Chenda, Tang, Liliang, Yu, Dong, Qian, Yanmin

arXiv.org Artificial IntelligenceJun-30-2025

Recent advancements in deep learning have significantly enhanced multilingual automatic speech recognition (ASR) due to the development of advanced model architectures and available large-scale multilingual datasets. Despite that, multilingual ASR still suffers from the curse of multilinguality in that different languages tend to interfere with each other, making it difficult for the ASR model to identify multiple languages effectively while sharing model capacity across them. This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper. Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods. Experimental results demonstrate that the proposed models yield approximately 10\% and 15\% relative performance gains in language-aware and language-agnostic scenarios, respectively.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.21555

Country:

Asia > China > Shanghai > Shanghai (0.04)
North America > United States (0.04)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models

Chen, William, Tian, Jinchuan, Peng, Yifan, Yan, Brian, Yang, Chao-Han Huck, Watanabe, Shinji

arXiv.org Artificial IntelligenceFeb-14-2025

Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d for future studies.

artificial intelligence, multilingual speech recognition, scaling law, (11 more...)

arXiv.org Artificial Intelligence

2502.10373

Country:

North America > United States (0.93)
Asia (0.93)

Genre: Research Report (0.82)

Industry: Law (0.34)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

Kashiwagi, Yosuke, Futami, Hayato, Tsunoo, Emiru, Arora, Siddhant, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-18-2024

End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.

encoder, recognition, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2406.12611

Country:

North America > United States (0.04)
Asia > Japan (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.92)

Add feedback

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

Li, Song, You, Yongbin, Wang, Xuezhi, Ding, Ke, Wan, Guanglu

arXiv.org Artificial IntelligenceSep-19-2023

Ref. [6, 7] introduced an additional language identification (LID) module Multilingual intelligent assistants, such as ChatGPT, have to predict language information, while Ref. [2] treated language recently gained popularity. To further expand the applications information as a special textual token and concatenated of multilingual artificial intelligence (AI) assistants and it to the input of the decoder of the autoregressive speech facilitate international communication, it is essential to enhance recognition model, achieving joint modeling of speech recognition the performance of multilingual speech recognition, and language identification. Ref. [3] provided language which is a crucial component of speech interaction. In this information directly as prior information to speech recognition paper, we propose two simple and parameter-efficient methods: models, this can be achieved by encoding language information language prompt tuning and f rame-level language as a one-hot vector or embedding and concatenating adapter, to respectively enhance language-configurable and it with acoustic features.

language information, recognition, speech recognition, (12 more...)

arXiv.org Artificial Intelligence

2309.09443

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

Sun, Eric, Li, Jinyu, Hu, Yuxuan, Zhu, Yimeng, Zhou, Long, Xue, Jian, Wang, Peidong, Liu, Linquan, Liu, Shujie, Lin, Edward, Gong, Yifan

arXiv.org Artificial IntelligenceJul-7-2023

We propose gated language experts and curriculum training to enhance multilingual transformer transducer models without requiring language identification (LID) input from users during inference. Our method incorporates a gating mechanism and LID loss, enabling transformer experts to learn language-specific information. By combining gated transformer experts with shared transformer layers, we construct multilingual transformer blocks and utilize linear experts to effectively regularize the joint network. The curriculum training scheme leverages LID to guide the gated experts in improving their respective language performance. Experimental results on a bilingual task involving English and Spanish demonstrate significant improvements, with average relative word error reductions of 12.5% and 7.3% compared to the baseline bilingual and monolingual models, respectively. Notably, our method achieves performance comparable to the upper-bound model trained and inferred with oracle LID. Extending our approach to trilingual, quadrilingual, and pentalingual models reveals similar advantages to those observed in the bilingual models, highlighting its ease of extension to multiple languages.

artificial intelligence, information, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2303.00786

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Language-Universal Adapter Learning with Knowledge Distillation for End-to-End Multilingual Speech Recognition

Shen, Zhijie, Guo, Wu, Gu, Bin

arXiv.org Artificial IntelligenceFeb-28-2023

In this paper, we propose a language-universal adapter learning framework based on a pre-trained model for end-to-end multilingual automatic speech recognition (ASR). For acoustic modeling, the wav2vec 2.0 pre-trained model is fine-tuned by inserting language-specific and language-universal adapters. An online knowledge distillation is then used to enable the language-universal adapters to learn both language-specific and universal features. The linguistic information confusion is also reduced by leveraging language identifiers (LIDs). With LIDs we perform a position-wise modification on the multi-head attention outputs. In the inference procedure, the language-specific adapters are removed while the language-universal adapters are kept activated. The proposed method improves the recognition accuracy and addresses the linear increase of the number of adapters' parameters with the number of languages in common multilingual ASR systems. Experiments on the BABEL dataset confirm the effectiveness of the proposed framework. Compared to the conventional multilingual model, a 3.3% absolute error rate reduction is achieved. The code is available at: https://github.com/shen9712/UniversalAdapterLearning.

adapter, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2303.01249

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition

Winata, Genta Indra, Wang, Guangsen, Xiong, Caiming, Hoi, Steven

arXiv.org Artificial IntelligenceDec-2-2020

One crucial challenge of real-world multilingual speech recognition is the long-tailed distribution problem, where some resource-rich languages like English have abundant training data, but a long tail of low-resource languages have varying amounts of limited training data. To overcome the long-tail problem, in this paper, we propose Adapt-and-Adjust (A2), a transformer-based multi-task learning framework for end-to-end multilingual speech recognition. The A2 framework overcomes the long-tail problem via three techniques: (1) exploiting a pretrained multilingual language model (mBERT) to improve the performance of low-resource languages; (2) proposing dual adapters consisting of both language-specific and language-agnostic adaptation with minimal additional parameters; and (3) overcoming the class imbalance, either by imposing class priors in the loss during training or adjusting the logits of the softmax output during inference. Extensive experiments on the CommonVoice corpus show that A2 significantly outperforms conventional approaches.

adapter, adjustment, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2012.01687

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Taiwan (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Speech Recognition With No Speech Or With Noisy Speech Beyond English

Krishna, Gautam, Tran, Co, Han, Yan, Carnahan, Mason, Tewfik, Ahmed H

arXiv.org Machine LearningJun-17-2019

In this paper we demonstrate continuous noisy speech recognition using connectionist temporal classification (CTC) model on limited Chinese vocabulary using electroencephalography (EEG) features with no speech signal as input and we further demonstrate single CTC model based continuous noisy speech recognition on limited joint English and Chinese vocabulary using EEG features with no speech signal as input.

artificial intelligence, recognition, speech recognition, (13 more...)

arXiv.org Machine Learning

1906.08045

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Diagnostic Medicine (0.67)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.35)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback